-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft ProForma implementation #37
Conversation
…o feature/proforma
…Flesh out the Generic modification resolver;
@levitsky This should finally be ready for review, with the fundamental functionality all in place. This adds the The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite. There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge, thank you!
I left some questions/comments in the code. I don't have any use cases, but I was able to catch a couple of issues by trying to parse examples from ProForma README. We can try copying those into tests and adding a psims
install to the GA workflow.
Otherwise, my only real concern is formula parsing. I left a comment about it in the code, too.
Thank you once again for the awesome work.
pyteomics/proforma.py
Outdated
from pyteomics.auxiliary import PyteomicsError, BasicComposition | ||
from pyteomics.auxiliary.utils import add_metaclass | ||
|
||
# To eventually be implemented with pyteomics port? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything you don't like about this dependency?
load_psimod = partial(_needs_psims, 'PSIMOD') | ||
load_xlmod = partial(_needs_psims, 'XLMOD') | ||
load_gno = partial(_needs_psims, 'GNO') | ||
obo_cache = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name does not seem to be used, is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name gets baked into the partial
-made function so that the error is clear about which "entity" depended upon the other source, e.g.
>>> load_psimod()
ImportError: Loading PSIMOD requires the `psims` library. To access it, please install `psims`
Technically, we could just make the message "Loading this controlled vocabulary requires psims
." and be done, but it feels less explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this comment of mine referred to obo_cache
only. It is not used in pyteomics
code. Github displays four lines for context when the last one is the one I put the comment on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarifying. Yes, that variable is imported to allow proforma
to expose control over the default file cache from psims
. That way users could set the cache directory or disable the cache altogether if they wished. Otherwise they'd need to explicitly import it from psims
. It probably isn't necessary to import it here and just needs to be documented clearly that it interacts with the psims
cache mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims
is used, or how it works with caches.
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>
Thank you for catching those leftover items. Compliance levels:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for solving the isotope parsing issue and laying out the supported features, this is very helpful. (Also for all the smaller edits.)
I noticed another couple of minor issues with the help of pyflakes.
Other than that, one question/suggestion I have is if you would agree with naming the entry level function just parse
, for the sake of brevity and consistency (see parser.parse
and all the read
functions).
|
||
def __getitem__(self, i): | ||
if isinstance(i, slice): | ||
props = self.properties.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
props = self.properties.copy() |
load_psimod = partial(_needs_psims, 'PSIMOD') | ||
load_xlmod = partial(_needs_psims, 'XLMOD') | ||
load_gno = partial(_needs_psims, 'GNO') | ||
obo_cache = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims
is used, or how it works with caches.
At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge. |
Thank you. There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group. |
I've updated the documentation on |
This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from
psims
.It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.
The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.